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Scientific  Progress 

The  PI  studied  all  mathematical  literature  he  can  find  related  to  the  Google  search  engine,  Google  matrix,  PageRank  as  well  as 
the  Yahoo  search  engine  and  a  classic  SearchKing  HIST  algorithm.  The  co-PI  immersed  herself  in  the  sociology  literature  for 
the  relevant  studies  on  social  network,  strong  and  weak  social  ties  and  etc.,  which  may  be  useful  for  the  study  of  search 
algorithms.  Our  findings  can  be  summarized  as  follows. 

{1)}  A  review  of  the  related  mathematical  literature  is  given  in  section  2.2  of  Appendix  I. 

{2)}  In  addition,  the  PI  presented  a  new  proof  of  the  convergence  of  Google  searching  algorithm  although  there  are  several 
proofs  available  already.  See  Section  2.3  of  Appendix  I.  This  analysis  is  used  for  new  algorithms  based  on  sociological 
analysis  discussed  in  6)  below. 

{3)}  The  PI  proposed  a  local  update  algorithm  which  is  new.  However,  its  computational  efficiency  is  not  compared  with  other 
updated  algorithms  yet.  See  Section  2.4  of  Appendix  I. 

{4)}  Certainly,  the  PI  studied  how  to  improve  the  efficiency  of  the  computation  in  hoping  to  speed  up  the  search.  Due  to  the 
explosive  increase  of  the  webpages  and  the  Internet  surfers,  such  a  study  is  definitely  necessary.  However,  the  study  does  not 
lead  to  any  meaningful  results  so  far. 

{5)}  The  PI  discussed  with  the  co-PI  about  the  web  search  algorithms  in  sociological  sense  which  is  one  of  the  key  points  of 
this  project.  The  PI  found  that  the  Bonacich  centrality  measure  is  the  same  as  the  one  used  in  Google  search  algorithm  if  the 
relation  matrix  is  the  adjacency  matrix  with  an  appropriate  normalization  constant.  See  section  3  of  Appendix  I. 

{6)}  The  co-PI  suggested  to  use  not  only  the  out-linkages,  but  also  the  in-linkages  in  the  search  algorithm.  In  addition,  the  co-Pi 
suggested  to  use  the  secondary  linkages  to  further  help  determine  the  relevance  in  PageRank.  Based  on  these  ideas,  the  PI 
has  proposed  a  new  version  of  searching  algorithm  which  uses  both  in-  and  out-linkages,  and  proved  that  the  modified 
algorithm  is  still  convergent  and  the  convergence  rate  is  determined  by  the  damping  factor.  See  section  4  of  Appendix  I. 

{7)}  The  co-PI  reviewed  a  lot  of  literature  in  sociology  related  to  the  PageRank,  the  relevant  scores  for  linkages,  in  particular,  for 
social  inkages,  as  well  as  the  social  strong  and  weak  ties,  social  cohesion,  position  and  distance.  See  Appendix  II. 

{8)}  The  co-PI  was  willing  to  write  it  out  a  section  to  explain  why  one  needs  to  use  both  in-  and  out-linkages  in  sociological 
terminology.  However,  the  PI  has  not  received  such  a  write-up  as  Jan.  15,  2013. 

{9)}  Finally,  the  PI  studied  the  Yahoo  search  algorithm  based  on  its  patent  although  the  current  Yahoo  updates  its  search 
algorithm  and  the  PI  did  not  find  enough  time  to  study  it.  See  Section  5  of  Appendix  I. 

Appendix  I: 

Ming-Jun  Lai  and  Dawn  Robinson,  A  Mathematical  and  Sociological  Analysis  of  Google  Search  Algorithm,  a  technical  report, 
2013. 

Appendix  II: 

Dawn  Robinson,  A  Summary  of  Sociological  Concepts  Related  to  Social  Network  and  Its  Techniques  for  Quantifying  Social 
Cohesion,  Social  Position,  Social  Distance,  a  technical  report,  2013. 
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1  Research  Activities: 

Dr.  Ming-Jun  Lai,  the  PI  and  Dr.  Dawn  Robinson,  the  co-PI  had  a  few  meetings  in  the  Lai’s  office  and  one 
meeting  at  the  Robinson’s  office  and  discussed  how  to  proceed  the  research  during  the  Aug.  and  Sept., 

2011.  They  had  several  emails  communications  before  and  after  these  meeting.  Also,  later  in  March 

2012,  they  had  another  meeting  on  the  distribution  of  the  grant  money,  one  summer  month  salary  for  the 
PI  and  one  summer  month  salary  for  the  co-PI.  The  PI  used  all  the  fund  allocated  for  travel,  a  part  of  the 
travel  fund  was  used  for  Mr.  George  Slavov,  one  of  the  Pi’s  graduate  students  for  his  summer  salary  and 
a  part  of  the  travel  fund  was  used  for  Mr.  Leopold  Matamba-Messi’s  travel  to  UCLA  for  a  conference. 
Mr.  Leopold  Matamba  Messi  is  another  graduate  students  of  the  PI.  Mr.  Matamba-Messi  graduated  in  the 
Aug.  2012  and  found  a  post-doc  position  at  the  Mathematical  Biology  Institute  at  Ohio  State  University. 
Since  Aug.,  2012,  the  PI  has  emailed  to  the  co-PI  about  the  final  report  several  times.  The  PI  and  co-PI 
met  again  on  Dec.  5,  2012  during  a  workshop  and  discussed  how  to  write  this  report.  They  have  met  again 
in  the  Lai’s  office  on  Jan.  16,  2013  for  submision  of  the  final  report. 

2  Research  Results: 

The  PI  studied  all  mathematical  literature  he  can  find  related  to  the  Google  search  engine,  Google  matrix, 
PageRank  as  well  as  the  Yahoo  search  engine  and  a  classic  SearchKing  HIST  algorithm.  The  co-PI  im¬ 
mersed  herself  in  the  sociology  literature  for  the  relevant  studies  on  social  network,  strong  and  weak  social 
ties  and  etc.,  which  may  be  useful  for  the  study  of  search  algorithms.  Our  findings  can  be  summarized  as 
follows. 

1)  A  review  of  the  related  mathematical  literature  is  given  in  section  2.2  of  Appendix  I. 

2)  In  addition,  the  PI  presented  a  new  proof  of  the  convergence  of  Google  searching  algorithm  although 

there  arc  several  proofs  available  already.  See  Section  2.3  of  Appendix  I.  This  analysis  is  used  for 
new  algorithms  based  on  sociological  analysis  discussed  in  6)  below. 

‘This  project  is  supported  by  Army  Research  Office  Grant  #  W91  INF-1 1-1-0322,  August  23,  201 1-Aug.  23,  2012 
'Dept,  of  Mathematics,  University  of  Georgia,  Athens,  GA  30602.  Email  Address:  mjlai@math.uga.edu 
"Dept,  of  Sociology,  University  of  Georgia,  Athens,  GA  30602.  Email  Address:  mjlai@math.uga.edu.  Email  address: 
sodawn@uga.edu 
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3)  The  PI  proposed  a  local  update  algorithm  which  is  new.  However,  its  computational  efficiency  is  not 

compared  with  other  updated  algorithms  yet.  See  Section  2.4  of  Appendix  I. 

4)  Certainly,  the  PI  studied  how  to  improve  the  efficiency  of  the  computation  in  hoping  to  speed  up 

the  search.  Due  to  the  explosive  increase  of  the  webpages  and  the  Internet  surfers,  such  a  study  is 
definitely  necessary.  However,  the  study  does  not  lead  to  any  meaningful  results  so  far. 

5)  The  PI  discussed  with  the  co-PI  about  the  web  search  algorithms  in  sociological  sense  which  is  one 

of  the  key  points  of  this  project.  The  PI  found  that  the  Bonacich  centrality  measure  is  the  same 
as  the  one  used  in  Google  search  algorithm  if  the  relation  matrix  is  the  adjacency  matrix  with  an 
appropriate  normalization  constant.  See  section  3  of  Appendix  I. 

6)  The  co-PI  suggested  to  use  not  only  the  out-linkages,  but  also  the  in-linkages  in  the  search  algorithm. 

In  addition,  the  co-Pi  suggested  to  use  the  secondary  linkages  to  further  help  determine  the  relevance 
in  PageRank.  Based  on  these  ideas,  the  PI  has  proposed  a  new  version  of  searching  algorithm  which 
uses  both  in-  and  out-linkages,  and  proved  that  the  modified  algorithm  is  still  convergent  and  the 
convergence  rate  is  determined  by  the  damping  factor.  See  section  4  of  Appendix  I. 

7)  The  co-PI  reviewed  a  lot  of  literature  in  sociology  related  to  the  PageRank,  the  relevant  scores  for 

linkages,  in  particular,  for  social  linkages,  as  well  as  the  social  strong  and  weak  ties,  social  cohesion, 
position  and  distance.  See  Appendix  II. 

8)  The  co-PI  was  willing  to  write  it  out  a  section  to  explain  why  one  needs  to  use  both  in-  and  out- 

linkages  in  sociological  terminology.  However,  the  PI  has  not  received  such  a  write-up  as  Jan.  15, 
2013. 

9)  Finally,  the  PI  studied  the  Yahoo  search  algorithm  based  on  its  patent  although  the  current  Yahoo 

updates  its  search  algorithm  and  the  PI  did  not  find  enough  time  to  study  it.  See  Section  5  of 
Appendix  I. 

3  Conclusion: 

The  PI  has  surveyed  all  the  mathematical  literature  related  to  the  search  engines  up  to  today  and  the  co-PI 
has  summarized  all  the  sociological  concepts  related  to  the  internet  search,  not  only  the  relevance  among 
webpages,  but  also  the  closeness,  betweenness  among  people,  groups,  etc..  The  goal  of  this  project  to 
understand  the  PageRank  sociologically  and  how  it  could  be  adjusted  to  be  more  effective  sociologically 
as  well  as  how  to  make  the  algorithms  more  efficient.  This  is  still  a  central  issue.  The  PI  is  not  able 
to  complete  this  goal  of  the  project  as  the  PI  and  co-PI  have  managed  their  time  to  only  do  the  surveys 
of  the  current  literatures  in  mathematical  and  sociological  senses  and  have  not  found  enough  time  to 
discuss  their  connections.  They  made  a  small  progress  towards  the  goal  as  mentioned  above.  The  PI 
would  like  to  continue  this  project  if  the  ARO  is  willing  to  fund  it  again.  Due  to  this  project,  the  PI  gains 
an  excellent  knowledge  on  the  mathematical  study  of  the  Google  searching  algorithm  and  other  search 
algorithms.  He  is  ready  to  attack  some  research  problems  related  to  the  algorithms  such  as  how  to  speed 
up  the  computation  and  how  to  update  in  parallel  the  Google  matrix,  its  PageRank  vector  and  etc..  Also, 
he  is  now  able  to  work  with  a  sociologist  better  as  he  knows  many  sociological  definitions  and  notations. 
See  Items  5)  and  6)  above  as  an  evidence. 
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With  more  time,  this  research  team  could  more  deeply  digest  the  existing  information,  and  propose 
some  new  algorithms,  making  use  of  sociological  insights  to  improve  mathematically  the  characterization 
of  relations  in  large,  complex  systems.  The  PI  and  co-PI  plan  to  continue  their  joint  work  toward  a  better 
understanding  of  the  searching  algorithms  and  discovering  how  to  better  situate  them  in  the  larger  contexts 
of  existing  mathematical  and  sociological  knowledge. 


Appendix  I 

Ming-Jun  Lai  and  Dawn  Robinson,  A  Mathematical  and  Sociological  Analysis  of  Google  Search  Algo¬ 
rithm,  a  manuscript,  2013. 


Appendix  II 

Dawn  Robinson,  A  Summary  of  Sociological  Concepts  Related  to  Social  Network  and  Its  Techniques  for 
Quantifying  Social  Cohesion,  Social  Position,  Social  Distance,  a  manuscript,  2013. 
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A  Mathematical  and  Sociological  Analysis 
of  Google  Search  Algorithm  * 

Ming-Jun  Lai*  Dawn  Robinson* 

January  16,  2013 


Abstract 

Google  search  algorithm  for  finding  relevant  information  on-line  based  on  keywords,  phrases, 
links,  and  webpages  is  analyzed  in  the  mathematical  and  sociological  settings  in  this  article.  We  shall 
first  survey  mathematical  study  related  to  the  Google  search  engine  and  then  present  a  new  analysis 
for  the  convergence  of  the  search  algorithm  and  a  new  update  scheme.  Next  based  on  sociological 
knowledge,  we  propose  to  use  in-  and  out-  linkages  as  well  as  use  the  second  order  linkages  to  refine 
and  improve  the  search  algorithm.  We  use  the  sociology  to  justify  our  proposed  improvements  and 
mathematically  prove  the  convergence  of  these  two  new  search  algorithms. 


1  Introduction 

Due  the  huge  volume  of  documents  available  through  the  world  wide  web,  a  search  engine  or  similar  ser¬ 
vice  is  now  absolutely  necessary  in  order  to  find  relevant  information  on-line.  About  12  years  ago,  Google, 
Yahoo  and  other  companies  started  providing  such  a  service.  It  is  interesting  to  know  the  mathematics  of 
computational  algorithms  behind  these  search  engines,  in  particular,  the  Google  search  engine  which  is 
so  successful  nowaday.  Many  mathematicians  and  computer  scientists,  G.  Golub  and  his  collaborators, 
Brezinski  and  his  collaborators,  Langville  and  Meyer,  among  other  pioneers  worked  on  the  mathematics 
of  the  Google  matrix  and  its  computation.  See  [25],  [16],  [9],  [8],  [31],  [32]  and  a  recent  survey  in  [2]. 
The  search  algorithm  also  becomes  a  very  useful  tool  in  many  Web  search  technologies  such  as  spam 
detection,  crawler  configuration,  trust  networks,  and  etc.,  and  find  many  applications  in  [12],  [13]  and 
etc..  Today,  it  is  the  most  used  computational  algorithm  in  the  world.  More  and  more  websites  such  as 
Facebook,  Amazon,  Netflix,  and  etc.,  use  similar  search  algorithms  for  various  purposes. 

As  almost  all  people  on  the  earth  use  the  Google  search  once  or  several  times  daily,  it  is  interesting 
to  see  if  the  search  results  make  any  sense  in  sociology.  How  could  it  be  adjusted  to  be  more  effective 
sociologically.  We  divide  this  article  into  two  parts.  We  shall  first  present  a  mathematical  description  of 
the  computational  algorithm  behind  the  Google  search  engine  and  summarize  the  recent  studies  related 
to  the  algorithm.  Then  we  present  a  sociological  analysis.  Based  on  the  sociology,  we  propose  two 
modifications  in  order  to  be  more  effective. 

This  research  is  supported  by  Army  Research  Office  Grant  #  W91  INF-1 1-1-0322. 

'mjlai@math.uga.edu.  Department  of  Mathematics,  The  University  of  Georgia,  Athens,  GA  30602. 

Lodawn@uga.edu.  Department  of  Sociology,  The  University  of  Georgia,  Athens,  GA  30602. 
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2  Mathematical  Analysis 


In  this  section  we  first  survey  the  mathematical  studies  related  to  the  Google  search  algorithm.  Then 
we  present  a  mathematical  analysis  which  is  new  to  the  best  of  our  knowledge.  Mainly  we  study  its 
convergence  as  it  is  an  iterative  method  since  the  Google  matrix  is  of  huge  size  so  that  any  direct  method  is 
impossible.  Then  we  shall  establish  the  convergence,  present  the  rate  of  convergence  and  a  local  updating 
scheme. 

2.1  Google  Search  Engine 

During  each  search,  after  entering  a  key  word  or  key  words,  the  Google  search  engine  will  produce  a 
ranking  vector  called  PageRank.  It  is  a  vector  of  huge  size  (several  billion  entries)  although  almost  all 
entries  arc  zero  in  general.  Each  entry  of  the  vector  presents  a  webpage.  Nonzero  value  in  an  entry  is  a 
weighting  factor  deciding  how  important  the  page  related  to  the  key  word(s)  just  entered.  The  importance 
is  determined  by  popularity.  The  ideas  for  this  computational  algorithm  have  been  reported  as  early  as 
1997-1998  by  Page  in  [37]  and  Brin  and  Page  in  [10].  See  also  [33]  for  explanation  of  the  ideas.  The 
algorithm  mainly  computes  a  numerical  weighting  (called  pagerank)  to  each  webpage  or  each  element 
in  a  hyperlinked  set  of  documents.  Pageranks  reflect  the  Google’s  view  of  the  importance  of  webages 
among  more  than  several  billion  webpages.  It  is  a  pragmatic  approach  to  improve  search  quality  by  going 
through  the  collective  intelligence  of  the  web  to  determine  a  page’s  importance. 

Let  v  be  a  vector  of  with  N  >  8  billion.  Any  unit  vector  in  MA  is  associated  with  a  webpage. 
Mathematically,  Pagerank  is  a  probability  function  of  vectors.  For  a  page  v,  let  P(v)  be  the  PageRank  of 
v.  By  a  stroke  of  the  ingenious,  Brin  and  Page  thought  that  P(v)  is  the  weighted  sum  of  the  PageRanks 
of  all  pages  pointing  to  v.  That  is, 

P(v)=  ^P(u)((u,v),  (1) 

u£ytv 

where  Av  is  the  collection  of  all  pages  connected  to  v  and  /:'(u.  v)  e  [0, 1]  be  a  weight  dependent  on  the 
number  of  links  from  page  u  to  the  other  pages.  More  precisely,  let  L(u)  be  the  number  of  linkages  from 
u  and  then  define  t(u.  v)  =  1  / L(u)  if  u  6  Av  and  0  otherwise  for  any  other  page  u.  Such  a  definition 
is  cyclic.  Also,  a  web  surfer  may  get  bored  and  abandons  the  current  search  by  teleporting  to  a  new  page. 
By  a  second  stroke  of  ingenious,  Brin  and  Page  introduced  a  damping  factor  and  modified  the  Pagerank 
function  P(v)  to  be 

P(v)  =  d  ^2  P(u)l(u,  v)  +  1  Vunit  vector  v  €  RN ,  (2) 

ueAv 

where  d  €  (0, 1)  is  a  damping  factor.  It  indeed  leads  to  a  convergent  algorithm  and  the  PageRank  P  will 
be  unique  defined.  In  summary,  we  have 

•  PageRank 

P  =  [P(Vl),--- ,P(V7V)]T.  (3) 

•  M  =  [((v,.  Vj  )]  1<);  .<N  be  the  adjacency  matrix  of  size  N  x  N  with  convention  i(vt.  v4)  =  0. 

•  1  =  [1, 1,  -  -  -  ,1]T 
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•  The  Google  Matrix 


5  =  dM  +  ^j^  1  1T.  (4) 

which  is  a  stochastic  matrix,  where  d  >  0  is  a  damping  factor.  In  Google  search  engine,  d  =  0.85. 

To  compute  the  PageRank  P,  the  Google  search  engine  computes:  starting  with  an  inital  vector 
Pi,e.p.,Pi  =  1/N, 

Pjfe+1  =  0Pfc,Vfc  =  l,2,---  (5) 

until  the  difference  of  P^+i  and  P&  within  a  given  tolerance. 

For  an  elementary  explanation  of  Google  computational  algorithm,  see  a  monograph  [33].  Many 
mathematical  problems  arise  from  the  algorithm.  For  example,  does  the  computation  (5)  converge  and 
what  is  the  convergence  rate?  Is  the  computation  stable?  How  to  speed  up  the  iterations  in  (5)?  How  to 
do  the  computation  in  parallel?  Can  we  have  d  =  1  ?  How  can  we  add  or  delete  a  page  web  from  the 
PageRank  function? 

2.2  Summary  of  Mathematical  Results 

The  major  question  for  the  computational  algorithm  (5)  in  the  previous  subsection  is  the  convergence  rate 
of  the  iterations  in  (5).  The  iteration  is  a  power  method.  The  convergence  is  dependent  on  the  second 
largest  eigenvalue  as  the  first  largest  eigenvalue  is  1  (cf.  [27]).  The  value  of  the  second  largest  eigenvalue 
or  eigenvalues  is  the  damping  factor  as  shown  first  in  [24] .  This  fact  is  later  established  in  [39]  by  using 
Jordan  canonical  form  of  the  Google  matrix.  It  was  also  proved  in  [14]  where  the  researcher  showed 
that  the  second  largest  eigenvalue  is  the  dampling  factor  even  the  Google  matrix  has  more  than  1  largest 
eigenvalues.  It  is  also  proved  in  [30].  In  [1 1],  the  searcher  interpreted  the  PageRank  algorithm  as  a  power 
series  which  leads  to  a  new  proof  of  the  several  known  results  about  the  PageRank,  e.g..,  the  convergence 
rate  is  the  exactly  the  damping  factor  by  [24]  and  presented  a  new  inside  about  PageRank  which  is  the 
probability  that  a  specific  type  of  random  surfer  ends  his  walk  at  the  page.  Usually,  the  random  surfer  is 
assumed  to  be  uniform  distribution  suggested  by  Page  and  Brin.  In  fact,  all  the  properties  of  the  PageRank 
hold  for  arbitrary  probability  setting.  We  shall  present  another  proof  in  the  next  subsection 

A  second  question  is  if  the  computation  is  stable  or  what  the  condition  number  of  the  Google  matrix 
is.  In  [18],  the  researchers  viewed  the  PageRank  vector  to  the  stationary  distribution  of  a  stochastic  matrix, 
the  Google  matrix.  They  analyzed  the  sensitivity  of  PageRank  to  changes  in  the  Google  matrix,  including 
addition  and  deletion  of  links  in  the  web  graph  and  presented  error  bounds  for  the  iterates  of  the  power 
method  and  for  their  residuals.  See  also  various  properties  in  [9]  including  the  condition  number  of  the 
Google  matrix. 

A  central  question  is  how  to  accelerate  the  Google  computational  algorithm.  There  are  several  ap¬ 
proaches. 

•  Extrapolation  In  [8],  the  researchers  studied  the  computation  of  the  nonnegative  left  eigenvector 
associated  with  the  dominant  eigevalue  of  the  PageRank  matrix  and  proposed  to  use  extrapolation 
methods  to  approximate  the  eigenvector.  Furthermore,  in  [9],  the  researchers  proposed  some  ac¬ 
celeration  methods  due  to  the  slow  convergence  of  the  Power  method  for  the  PageRank  vector. 
The  acceleration  metod  can  be  explained  as  the  method  of  Vorobyev  moments  and  Krylov  sub¬ 
space  method.  In  addition,  many  properties  about  the  Google  matrix  and  PageRank  vectors  were 
presented. 


3 


•  Sparsity  Split  In  [40],  these  Chinese  researchers  use  the  sparsity  of  the  hyperlink  matrix  (one  of  the 
two  matrices  in  Google  Matrix)  and  proposed  a  linear  extrapolation  to  optimize  each  iteration  in 
the  Power  method  for  the  computation  of  PageRank  and  reduction  of  the  storage  space.  The  idea  is 
simple  and  the  engineers  in  Google  may  have  already  implemented. 

•  Iterative  Solutions  The  iteration  in  (5)  can  be  reinterpreted  as  a  linear  system  to  be  given  in  (8).  As 
explained  in  [15],  the  Google  matrix  is  very  large,  sparse  and  non- symmetric.  Solution  of  the  linear 
system  by  a  direct  method  is  not  feasible  due  to  the  matrix  size.  Many  iterative  solutions  such  as 
Gauss-Jacobi  iterations.  Generalize  Minimum  Residual  (GMRES),  Biconjugate  Gradient  (BiCG), 
Quasi-Minimal  Residual  (QMR),  Conjugate  Gradient  Squared  (CGS),  Biconjugate  Gradient  Sta¬ 
bilized  (BiCGSTAB),  Chebyshev  Iterations.  The  researchers  proposed  to  use  the  parallel  Block 
Jabobi  iteration  with  adaptive  Schwarz  preconditioners. 

•  Parallel  Iteration  In  [15],  the  researchers  described  a  parallel  implementation  for  PageRank  com¬ 
putation  and  demonstrated  that  the  parallel  PageRank  computing  is  faster  and  less  sensitive  to  the 
changes  in  teleportation. 

•  Lumping  Dangling  Nodes  In  [19],  the  researchers  lumped  all  of  the  dangling  nodes  together  into 
a  single  node  and  the  Google  matrix  can  be  reduced.  They  showed  that  the  reduced  stochastic 
matrix  has  the  same  nonzero  eigenvalues  as  the  full  Google  matrix  and  the  convergence  rate  is  the 
same  when  the  Power  method  is  applied.  In  [34],  the  researchers  further  reduced  the  Google  matrix 
by  lumping  more  nodes  which  arc  so-called  weakly  nondangling  nodes  together  and  the  reduced 
matrix  has  the  same  nonzero  eigenvalues  as  the  Google  matrix. 

•  Distributed  Randomized  Algorithm  In  a  series  of  papers  ([20],  [21],  [22],  the  researchers  applied 
distributed  randomized  algorithm  for  PageRank.  The  idea  is  to  let  each  page  compute  its  own 
value  by  exchanging  information  with  its  linked  pages.  The  communication  among  the  pages  is 
randomized  so  as  to  make  it  asynchronous.  The  communication  among  pages  in  the  sense  that  data 
transmitted  over  the  links  is  received  correctly  or  could  have  some  noises. 

There  are  several  other  improvements.  For  example. 

One  is  to  update  the  PageRank.  In  [33],  the  researchers  considered  the  PageRank  vector  to  be  an 
state  of  homogeneous  irreducible  Markov  chain  and  the  chain  requires  updating  by  altering  some  of  its 
transition  probabilities  or  by  adding  or  deleting  some  states.  They  proposed  a  general  purpose  algorithm 
which  simultaneously  deals  with  both  kinds  of  updating  problems  and  proved  its  convergence.  We  shall 
present  another  approach  for  local  updating  to  be  discussed  in  the  next  subsection. 

Another  improvement  is  to  distinct  the  webpage  with  equal  rankings.  In  [35],  the  researchers  proposed 
an  idea  to  organize  the  search  result  produced  by  the  PageRank  where  several  pages  have  the  same  rank 
score  so  that  the  surfer  can  get  more  relevant  and  important  results  easily.  Their  idea  is  to  add  a  weight  to 
each  page. 

We  finally  remark  that  besides  the  PageRank  algorithm,  there  arc  many  other  search  algorithms  avail¬ 
able.  For  example,  Kleinberg’s  HITS  algorithm  is  a  classic  Hypertext  Induced  Topics  Search  (HITS) 
algorithm  invented  by  Kleinberg  (cf.  [28]).  Kleinberg  viewed  that  most  pages  arc  a  hub  and  authority, 
i.e.,  a  general  page  not  only  points  to  other  pages  and  also  receives  points  from  other  pages.  Again,  the 
HITS  algorithm  uses  the  power  method  to  the  hub  weight  vector  and  authority  weigh  vector.  See  [33]  and 
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[6]  for  more  explanation.  For  another  example,  Yahoo  uses  another  type  of  algorithms.  See  last  section 
of  this  paper. 

In  summary,  the  study  of  Google  computational  algorithm  requires  many  mathematical  tools  such 
that  graph  theory  (cf.  [?]),  random  walks  and  Markov  chains  (c.g.,  [4],  [33]),  numerical  analysis  (cf. 
[16]),  [15],  [8]),  stochastic  matrices  (cf.  [18]),  networks  theory  (cf.  [6]),  etc..  See  [2]  for  a  survey 
of  recent  techniques  for  on  link-based  ranking  methods  and  the  PageRank  algorithm  to  find  the  most 
relevant  documents  from  the  WWW. 

2.3  Our  Convergence  Analysis 

We  see  that  P(v)  >  0.  We  now  claim  that  P(v)  <  1.  In  fact,  we  can  prove  the  following 

Theorem  2.1  Let  v*,  i  =  1,  •  •  •  ,N  be  the  standard  unit  vectors  in  M,v.  Then  the  PageRanks  P(v,;) ,  i  = 
1,  •  •  •  ,  N  satisfy 

N 

Yp^)  =  1- 

i= 1 

Proof.  Indeed,  as  in  the  definition  (2),  P(vj),  i  =  1,  •  •  •  ,  N  arc  dependent  on  each  other.  We  recall 

P  =  [-P(vr),  -  -  -  ,P(wv)]T 


is  the  PageRank  vector  in  M;V .  We  can  see  that 

N 

2^(vj,v*)  =  1  (6) 

i= 1 

for  j  =  1,  •  •  •  ,  N.  We  call  £(vi,  v ; )  adjacency  functions.  It  follows  from  (2),  we  can  write 

P  =  —jy—  1  +  dMP ,  (7) 

where  M  =  [((v,,  v?)]  |<?;  -<N  be  the  adjacency  matrix  of  size  N  x  N  with  convention  ('(v,.  Vj)  =  0  and 
1  =  [1, 1,  •  •  •  ,  1]T  €  Hence,  we  have 

N  , 

P(vi)  =  1TP  =  +  dlTMP 

1=1 

N 

=  1  —  d  +  dlTP  =  1  —  d  +  d'Y,  P(vi), 

1=1 


N 

where  we  have  used  (6)  to  have  1 T M  =  1T.  That  is,  we  have  (1  —  d)  Yj  P(v*)  =  1  —  d  and  hence, 

i= 1 
N 

Y^  P(vi)  =  1  since  d  f  1.  This  completes  the  proof.  ■ 

i=  1 
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The  PageRank  algorithm  offers  two  ways  to  calculate  the  PageRank  vector  P:  one  is  by  algebraic 
method  and  the  other  by  an  iterative  method.  Indeed,  by  (7),  we  have 

(I-dM)  P  =  ^l  or  P=1^d(I-dM)~1l,  (8) 

where  I  is  the  identity  matrix  of  size  N  x  N.  This  algebraic  method  will  be  well-defined  if  I  —  dM  is 
invertible.  Indeed,  we  have 

Theorem  2.2  Let  I  be  the  identity  matrix  and  M  be  the  adjacency  matrix  defined  above.  If  d  €  (0, 1), 
then 

I-dM 


is  invertible. 


Proof.  We  first  recall  a  matrix  A  =  «y]  i<ij<Ar 's  diagonally  dominant  if 


N 


3  =  1 


for  alH  =  1,  •  •  •  ,  N.  It  is  easy  to  see  that  (I  —  dM)  is  diagonally  dominant  matrix  by  using  (6)  with 
d  <  1.  It  is  known  (cf.  [27],  p.  178)  that  every  diagonally  dominant  matrix  is  nonsingular,  and  hence  is 
invertible.  So  is  I  —  dM.  m 

Certainly,  finding  the  inverse  ( I  —  dM)~l  is  expensive  in  computation.  An  easy  approach  is  to  use  an 
iterative  method,  for  example,  power  method.  Starting  with  Po  =  1  /N,  we  iteratively  compute 

Pfc  =  GPk-1  =  '  v  f/ •  +  dMPfc_!  (9) 

by  using  Theorem  2.1  for  k  =  1, 2,  •  ■  •  We  now  show  that  P / ,  k  >  1  converge.  Let  us  use  the  l\  norm 
for  matrices  to  measure  the  errors  P&  —  P.  Recall  the  t\  norm  for  matrix  A  with  A  =  [ciij]i<i,j<N  is 


N 

i  =  max  >  |  a. 

7=i>"'  i  tv 

i=i 


%j\- 


Then  we  have 


Theorem  2.3  Let  P  be  the  PageRank  vector  defined  by  the  algebraic  method.  For  all  k  >  1, 


|Pfe  -  P||i  <Cd\ 


where  C  is  a  positive  constant. 

Proof.  It  is  easy  to  see  from  (7)  and  the  definition  (9)  of  iterations  that 

Pfc  -  p  =  dM( ;pfc_!  -  P)  =  •  •  •  =  (dM)k( P0  -  P). 
Thus,  letting  C  =  ||Po  —  P||i, 

||Pfc  -  P||i  <  dk\\Mk\hC  <  Cdk\\M\\i. 
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Finally  we  note  that  ||M||i  =  1  by  using  (6)  to  complete  the  proof.  ■ 

We  remark  that  the  convergence  of  P/  to  P  was  shown  numerically  in  [37]  for  full  size  and  a  half 
size  link  databases.  The  rates  of  convergence  arc  nearly  the  same  for  these  two  sizes.  Also  in  [23], 
the  contributor(s)  of  this  wiki  page  uses  the  power  method  for  the  leading  eigenvector  to  explain  the 
convergence  of  the  iterative  method  above  without  giving  a  convergence  rate  as  it  is  based  on  the  ratio 
of  the  second  largest  and  the  largest  eigenvalues  of  the  Google  matrix.  The  second  eigenvalue  of  the 
Google  matrix  was  studied  in  [24]  where  the  rate  of  convergence  of  PageRank  was  determined  to  be  0.85, 
the  damping  factor  d.  As  mentioned  in  a  previous  section,  the  convergence  rate  has  been  determined  by 
several  methods  already.  Our  result  in  Theorem  2.3  gave  another  proof.  It  is  much  simpler  and  easier 
fashion  than  the  study  in  [24]. 


2.4  A  New  Local  Update  Scheme 

We  discuss  how  to  do  local  updates.  That  is,  in  practice,  webpages  arc  updated  constantly  in  the  sense 
that  the  number  L(vj)  of  links  from  webpage  vt  changes  very  often,  usually  increases.  A  new  adjacency 
matrix  M  is  needed  updated  everyday.  Instead  of  using  an  updated  matrix  M  to  compute  a  new  PageRank 
vector  P  for  all  components,  we  discuss  an  approach  to  update  partial  components  of  P.  We  need  the 
following  well-known  formula: 


Lemma  2.1  (Sherman-Morrison’s  formula, 1949)  If  A  is  invertible  and  x,  y  are  two  vectors  such  that 
yTA~1x  7^—1,  then  A  +  xyT  is  invertible  and 


(A  +  xyT)~1  =  A-1 


A-1xyTA_1 
1  +  yTA-1x' 


Assume  that  we  have  an  updated  £4  (vz,  v7)  for  a  fixed  integer  j  and  M+  be  the  updated  adjacency 
matrix.  Let  n  =  n{j)  and  n+  =  n+  (j )  be  the  previous  and  updated  numbers  L(\j )  of  links  from  webpage 
Vj.  Let  us  forcus  the  case  that  only  more  webpage  links  are  added  to  the  existing  links  from  the  webpage 
Vj.  In  this  case  n+  >  n.  Hence,  we  have 

M+  =  M  +  XjvJ, 


where  Xj  =  [£+(vi,Vj)  —  £(vi,vj),i  =  1,  •  •  •  ,  N]T  with  £+(vi,  Vj)  is  the  updated  version  of  £(vi,Vj) 
and 

,, , ,  „ ,  ..  ,  n_i_  —  n  n+  —  n 

M  —  M+  1  =  1  —  n  n. i_  4 - <  2 - . 

n+  n+ 

We  shall  use  P  +  to  be  the  updated  PageRank  vector  satisfying 


P+ 


1  -d 
N 


1  +  dM+  P 


By  Theorem  2.2,  we  know  I  —  dM+  is  invertible.  By  the  inverse  formula  in  Lemma  2.1,  we  know 
vj  (I  —  dM)~1xj  f  —1  and  hence,  letting  a  =  1  +  v  J  (I  —  dM)~1xj  f  0, 

(1  -  dAf+r1  =  (1  -  dM)~l  -{I-  dM)~1xjvJ  (1  -  dM)~l/a.  (10) 
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It  follows 


P+  =  (l-dM+)~1^l 

=  (1  -dMy1^-^!-  i(/-dM)_1Xjv7(l-dM)-1^4l 

=  P-  -(/-dM)-1x7-vT P 
a  J 

=  P  -  -^(J-  dM)-1*-.  (11) 

a 

p. 

where  Pj  is  the  jth  component  of  vector  P.  Let  us  write  e  =  —  (I  —  dM  'f  'x;  for  the  update  for  P  to 

obtain  P  and  we  next  discuss  how  to  find  the  update  efficiently  as  solving  (/  —  dM)~  'xy  is  expensive 
in  computation. 

We  need  the  following  (cf.  [27],  p.  198) 

Lemma  2.2  (Neumann  Series)  If  A  is  an  n  x  n  matrix  and  ||A||i  <  1,  then  I  —  A  is  invertible  and 

OO 

a-Ar^y^AK 

k= 0 

As  shown  before  ||M||i  =  1  and  d  G  (0, 1),  letting  A  =  dM  in  Lemma  2.2  above,  we  have 

OO 

(/  -  dM)-lKj  =  dkM%. 

k= 0 

Algorithm  2.1  (A  Single  Update)  Choose  an  integer  K  large  enough  such  that  dK  is  within  a  given 
tolerance.  Start  with  y  =  x;  and  for  k  =  1,  •  •  •  ,  K  —  1,  we  compute 

y  =  dMy  +  xj. 

Then  writing  y  =  [yi,  ■  ■  ■  ,  y_/v]T»  we  let  a  =  1  +  yj  and  e  =  (Pj / ct)y.  Thus,  by  (11),  P+  «  P  —  e  within 
the  given  tolerance. 

The  above  algorithm  leads  to  sequentially  updating  P,  i.e.  update  one  webpage  or  one  keyword  or 
one  document  at  a  time.  See  Remark  5.1  for  a  parallel  update  scheme. 

3  Sociological  Analysis 

Sociologists  have  studied  the  linkages,  networks  for  a  long  time.  Many  concepts  of  centrality  measures 
have  been  created.  In  [7],  Bonscich  proposed  a  measure  of  centrality  c(a,  3)  with  normalization  factor  a 
and  parameter  fl  which  is  a  function  of  unit  vectors  of  R;V.  For  each  unit  vector,  say,  v*, 

Ci(a,3 )  =  +  3cj(a,3))Rij 

j 
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(12) 


where  Rij  is  an  entry  of  matrix  relation  R  which  may  not  be  symmetric  and  main  diagonal  elements  arc 
zero.  In  matrix  notation 

c(a,/3)  =  a(I  —  /3R)_1R1.  (13) 

See  page  1173  of  [7].  If  we  use  the  adjacent  matrix  for  R  and  normalization  factor  a  =  (1  —  d)/N ,  we 
see  that  (12)  is  just  (2)  when  ft  =  d,  the  damping  factor.  If  a  =  (1  —  d)/N  and  R1  =  1,  we  can  see  that 
(13)  is  the  same  as  the  equation  on  the  right-hand  side  of  (8).  Using  these  two  parameters,  the  PageRank 
is  just  c(a,  ft).  According  to  Bonacich,  when  R  is  asymmetric,  e.g.,  R  =  M ,  c (a,  (5)  measures  prestige 
if  ft  >  0.  This  gives  a  sociological  justification  that  the  Google’s  PageRank  does  measure  the  important 
relevance  scores  for  each  webpage. 

Note  that  when  R  is  symmetric,  c(a..  ft)  measures  centrality.  Also,  as  in  [7],  ft  can  be  negative.  For 
example,  in  a  communication  network. 

As  pointed  out  by  Bonacich,  his  measure  c (cr,  ft)  seems  hopelessly  ambiguous,, 


4  Improved  Search  Algorithms 

We  now  explain  the  improvements  discussed  in  the  previous  section  in  mathematical  terminology.  Recall 
all  the  notations  and  definitions  in  a  previous  section.  We  use  page  to  denote  a  webpage,  a  keyword  or 
phrase,  a  document,  etc.. 


4.1  Convergence  Analysis  of  In-  and  Out-Linkage  Search  Algorithm 

For  each  page  v,  Av  is  the  collection  of  all  pages  connected  from  v.  We  now  search  all  u  £  Av  to  see  if 
u  connects  to  v  also.  We  denote  by  Aft  the  collection  of  such  page  u  £  Av.  Note  that  Aft  could  be  an 
empty  set  if  v  is  an  unknown  page  to  the  world.  Anyway,  Aft  C  Av  is  a  subset.  Let 

L(v)  =  #(Av)  +  #(A+),  (14) 


where  #(A)  stands  for  the  number  of  entries  in  set  A.  It  is  called  the  cardinality  of  A.  We  define  the  new 
adjacency  entry 


*(v,u) 


'VUyft 
<  2/L(v), 
0 


if  u  is  not  connected  to  v  from  u 
if  u  is  connected  to  v 
otherwise. 


(15) 


According  to  our  Plan  1,  the  new  PageRank  function  P(v)  is  now  defined  by 


P(v)  =  i  Y.  f(u«".')+X' 

usAv 


V  any  unit  vector  v  £  M;Y, 


(16) 


where  d  £  (0, 1)  is  a  damping  factor.  To  find  the  value  of  P(v),  we  use  similar  iterative  algorithm  as  in 
Section  2.  Let 


M  = 


-I 


(17) 


We  are  now  ready  to  define 
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Algorithm  4.1  (In-  and  Out-linkage  Search  Algorithm)  Starting  from  Po  =  1  /N,  we  iteratively  com¬ 
pute 

Pk  =  ^-^1  +  dMPk-u  Vfc  =  1,2,, ....  (18) 

For  convenience,  we  write  P  to  the  new  PageRank  vector  of  all  pages  Vj,  <E  R;V,  i  =  1,  •  •  •  .  N . 

Based  on  the  analysis  in  Section  2,  we  can  conclude  the  following 

Theorem  4.1  Let  P  be  the  PageRank  vector  defined  above.  For  all  k  >  1, 

||P*  -P||i  <  Cdk,  Vk>  1, 

where  C  is  a  positive  constant  independent  ofk. 


4.2  Convergence  Analysis  of  Second  Order  linkage  Search  Algorithm 

Next  we  study  our  improved  search  algorithm  2.  In  addition  to  use  Af,  we  count  the  links  from  u  £  Av. 
That  is,  we  look  at  Au.  More  number  of  links  in  ,4U.  the  less  contribution  to  u  according  to  the  sociological 
justitication  in  the  previous  section.  Thus,  we  define 


n(v) 


#(av)  +  y 

usAv 


1 

#(A0 


and 


?(u,v) 


1  _i _ 1_ 

1  +  #(AU) 
*  n(v) 

0 


if  u  £  Av 
otherwise. 


It  is  easy  to  see 

Lemma  4.1  For  any  page  v 


N 

£?(v;>v)  =  l. 


3=1 

Thus  the  new  PageRank  function  P(v)  is  now  defined  by 


(19) 


(20) 


(21) 


p(v)  =  d  y  p(u)?(U,v)  +  i^, 

usAv 


V  any  unit  vector  v  £  ~RN , 


(22) 


where  d  £  (0, 1)  is  a  damping  factor.  To  find  the  value  of  P(v),  we  do  the  iterations  as  before.  Letting 


M  = 


Z(vi,Vj 


(23) 


Algorithm  4.2  (Second  Order  linkage  Search  Algorithm)  Starting  from  Po 
compute 


Pk 


1  -d 
N 


1  +  dMPfc-i , 


Vk  =  1,2,,.... 


1/N,  we  iteratively 
(24) 
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Similarly,  we  write  P  to  the  new  PageRank  vector  of  all  pages  v(,  e  M.N ,  i  =  1,  ■  •  •  ,  N. 

Based  on  the  analysis  in  Section  2,  we  can  conclude  the  following 

Theorem  4.2  Let  P  be  the  PageRank  vector  defined  above.  For  all  k  >  1 , 

||Pfc  -P||i  <  Cdk,  \/k>  1, 

where  C  is  a  positive  constant  independent  of  k. 

5  Yahoo  Search  Engine 

Yahoo  search  engine  is  quite  different  from  the  Google  search  engine  conceptually  and  hence,  computa¬ 
tionally.  The  following  description  of  the  Yahoo  search  engine  is  based  on  [38]. 

Yahoo  emphasizes  on  the  similarity  of  any  two  webpages,  two  documents,  or  two  web  hubs.  There  are 
many  sources  of  similarity  information  including  link  similarity,  text  similarity,  multimedia  component 
similarity,  click  through  similarity,  title  similarity,  URL  similarity,  cache  log  similarity,  and  etc..  For 
simplicity,  let  us  consider  the  search  engine  for  documents.  Even  for  documents,  there  are  many  categories 
of  similarity  such  as  title  similarity,  author  similarity,  text  similarity  and  etc..  For  example,  the  text 
similarity  is  the  dot  product  of  two  vectors  which  are  the  frequency  usages  of  various  words.  That  is, 
suppose  that  two  documents  use  total  M  words.  For  one  document,  let  x  be  the  vector  of  length  M  whose 
each  entry  is  the  frequency  usage  of  a  word  in  the  document.  Similar  for  vector  y  for  the  other  document. 
Then  x  •  y  denote  the  similarity  between  these  two  documents. 

Suppose  that  the  base  set  of  all  on-line  documents  contains  N  documents.  The  global  similarity 
matrix  W  =  ['Wifi  i<i,j<N  with  Wij  >  0  representing  the  strength  of  the  similarity  between  document  i 
and  document  j.  Assume  that  v'tJ  <  M  for  all  i,j  =  1,2,---  ,  N.  Suppose  that  there  arc  K  categories 
of  various  similarities.  For  each  k  =  1,  •  •  ■  ,  K,  let  x(fc)  =  [aq*.,  aq k,  •  ■  ■  ,  x^k\  F  be  a  vector  with 
nonnegative  entries  satisfying 

N 

^2 Xik  =  l. 

i= 1 

It  is  a  vector  of  scores  or  confidence  values  for  all  documents  with  respect  to  the  klh  category  of  similarity. 
Note  that  the  global  similarity  matrix  combines  all  sources  of  similarity  information  together.  Thus, 
=  f(wij(  1),  •  •  •  ,  Wij(K ))  with  Wij(k),  k  =  1,  •  ■  ■  ,  K  being  the  similarity  between  document  i  and 
document  j  with  respect  to  the  category  k.  Yahoo  considers  a  global  similarity  objective  function 

K  N 

P(x)  =  £  x(fc)TIUx(fc)  =  EE  Xjk  liUj  X jk  • 

k= 1  k= 1  i,]= 1 

Yahoo  engine  will  have  to  first  do  data  preparation  or  preprocessing  by  generating  the  global  similarity 
matrices  W  based  on  training  and  learning  and  then  compute  the  matrix  x  to  maximize  the  objective 
function  value  by  solving  the  following  maximization  problem: 

max  P(x),  (25) 

x(fc)ec 
1  <k<K 
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where  x(fc)  =  [xlk,x2k% *  •  •  ,xNk]T  and  C  =  {(xi,x2,  ■  ■  ■  ,xn),Xi  >  0,  i  =  1,  •  •  •  ,  N,  Y.iL\  xi  =  1} 
is  a  convex  set  in  KA\  This  solution  matrix  x  determines  the  search  results  according  to  the  numerical 
values  in  the  entries  of  x.  For  document  i,  the  document  j  is  most  similar  to  document  i  if  the  distance 
between  vectors  ,  k  =  1,  •  •  •  ,  K  and  Xjk,  k  =  1,  •  •  •  ,  K  is  the  smallest. 

Since  P(x)  is  a  quadratic  function  of  x  and  hence  continuous,  P(x)  achieves  its  maximum  within  the 
bounded  set  C.  It  is  easy  to  see 

Lemma  5.1  Let  M  =  maxi <ij<n  Wij  be  the  maximum  of  the  entries  of  the  global  similarity  matrix. 
Then  -P(x)  <  MK  for  all  x(fc)  G  C,  k  =  1,  •  ■  •  ,  K. 

Proof.  Since  Ya=  t  xi  =  1  and  xi  -2  0  f°r  all  i  =  1,  •  •  •  ,  N,  we  have 

K  K  I<  n  n 

pw  =  E  E  3'ik'Wij  ^  EE  MxiXj  =  M  EE  xjkxik  —  MK. 

k=  1  ij  k=  1  ij  k= 1  i=l  j= 1 


However,  P(x)  may  achieve  its  maximum  at  several  vectors.  For  example,  letting  K  =  1,  N  =  3  and 
W  be  the  identity  matrix  of  size  3  x  3,  we  know  the  maximum  is  1  when  x  =  (1, 0, 0),  (0, 1,  0),  (0, 0, 1). 
Certainly,  this  is  just  a  pathological  example  as  no  one  uses  an  identity  matrix  for  the  global  similarity 
matrix.  The  algorithm  Yahoo  uses  produces  a  unique  maximizer  each  time  as  it  will  be  discussed  below. 

To  solve  the  maximization  problem,  Yahoo  uses  the  Growth  Transformation  algorithm! — (GTA).  The 
most  important  advantages  of  this  algorithm  arc  that  it  guarantees  a  monotonic  increase  sequence  of 
the  objective  function  values  and  the  computation  is  stable  as  each  step  is  a  convex  combination  of  the 
previous  step.  For  simplicity,  we  assume  that  K  =  1.  In  fact,  Yahoo  engine  does  a  loop  to  go  through 
each  category  using  the  following  GTA. 


Algorithm  5.1  (the  Growth  Transformation)  Starting  with  a  trained  vector  x1 1 '  in  C,fork  =  1,2,--- 
one  computes 


x(fc+1) 

xj 


xfgf-FfxW) 


N 


EP>!t(xA 

i=  1 


(26) 


forj  =  !,-••  ,  N . 


This  algorithm  is  mainly  based  on  an  inequality  proved  in  [5].  For  convenience,  we  present  a  short 
version  of  the  proof  due  to  the  degree  d  =  2. 


Lemma  5.2  (Baum  and  Eagon,  1967)  Let  P(x)  be  a  homogeneous  polynomial  of  degree  d  with  nonneg¬ 
ative  coefficients.  Let  x  G  C  the  convex  defined  above.  For  any  x  G  C,  let 


y  =  , Vn)  with yj  = 


x3lLPW 


N 


d 


-,j  = 


.  N. 


i=  1 


Then  P(y)  >  P(x)  unless  y  =  x. 
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Proof.  For  convenience,  let  us  use  P(x)  =  xTICx  to  prove  this  inequality,  i.e.,  we  assume  that  d  =  2. 
Clearly,  P  is  a  homogeneous  polynomial  of  degree  d  =  2.  Since  the  entries  of  W  are  nonnegative,  P(x) 
satisfies  the  assumptions  of  this  lemma.  Note  that 


d_ 

dxi 


P(x)  =  2 


WijXj 


and 

N  o 

=  2P(x)‘ 

i=l  1 

Thus,  we  first  use  Cauchy-Schwarz’s  inequality  and  then  the  geometric  and  arithmetic  mean  inequality  to 
have 


P(X)  =  52  WijXiXj  =  J2  (WijViyj)1'3  (wTXiX^7~vJ3 

i,j=l,-,n  i,j= 1  '  \yiVj) 

'  n  \  /  n 

52  wayiVi )  (  t  ,rux>xi  ( zrzf] 

,,  I  /  \  i,  7=1  V'  !li!l  i  J 


< 


< 


1/3 


52  wayiyj 


\ij=  1 


El  ( Xi  Xj 
WijXiXj- - h  — 

,y=,  w 


2/3 


Since  m  =  xy  ^  WijXj/0(x),  we  have 


1=1 


1  f  Xi  Xj  \  1  P(x)  P(x) 

>  WijXiXj-  —  +  —  =  -  >  WijXiXjl^n - +  ™ - 

2  V2/i  vjJ  -  ;y  Ej=imjXj  E*=1  wvxi 


1= 1  *=1 


y 


s/nn  n  n  n  n  \ 

2  «W/(£  WikXk^  +  X]  X1  X]  WijXi/ (52  WkjXk)  ) 

V*=i  i=1 

\*=i  i=i  / 


=  P(x)- 


since  x£C.  Combining  the  above  discussion  together,  we  have 

p(x)  -  f  XI  wnviyj j  (p(x))2/3- 


n 

It  thus  follows  that  P(x)  <  Wijmyj  =  P(y).  Note  that  from  the  above  discussion,  the  inequalities 

*,i=t 

become  equalities  for  Cauchy-Schwarz’s  inequality  and  geometric  and  arithmetic  mean  inequality  when 
y  =  x.  Therefore,  we  have  an  strictly  inequality  when  y/x.  We  have  thus  completed  the  proof.  ■ 
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With  this  lemma,  we  arc  able  to  show  the  convergence  of  Algorithm  5.1.  Due  to  the  monotonically 
increasing  property  of  P(x.(k'1)  and  the  boundedness  of  P(x)  by  Lemma  5.1,  we  see  that  P(x®)  is 
convergent  to  P(x*)  for  some  point  x*.  In  practice,  the  iterations  may  end  in  a  finite  steps.  Note  that  for 
x  G  C,  VP(x)  f  0  and  each  component  of  VP(x)  is  strictly  positive.  Thus,  objective  function  values 
arc  increasing  until  P  achieves  its  maximum.  In  this  case,  by  Lemma  5.2,  we  have  x(n+ 1  =  xkn>  for 
some  integer  n. 

Also,  x^k> ,  k  >  1  arc  all  in  C  and  hence,  arc  bounded.  It  follows  that  there  exists  a  convergent 
subsequence.  Due  to  the  nonuniqueness  of  the  maximizers,  we  can  not  conclude  the  convergence  of  the 
whole  sequence. 

Suppose  that  W  is  a  positive  definite,  P(x)  is  a  convex  function,  when  x  and  y  defined  in  Lemma  5.2 
are  close  enough,  we  have 

VP(x)  •  (y  -  x)  >  0  (27) 

since  P(y)  >  P(x).  It  follows  that 

P(y)  -  P(x)  =  P(y  -  x)  +  2(y  -  x)Tlfx  =  P(y  -  x)  +  (y  -  x)TVP(x)  >  An||y  -  x||2,  (28) 

where  An  denotes  the  smallest  eigenvalue  of  W  and  ||z||  denotes  the  (2  norm  of  vector  z  e  Mn.  Therefore 
we  have 


Theorem  5.1  Suppose  that  W  is  symmetric  and  positive  definite.  Let  M  be  the  largest  value  of  the  entries 
ofW.  Let  An  be  the  smallest  eigenvalue  ofW.  Let  ySk\  k  >  I  be  the  sequence  from  Algorithm  5.1.  Then 
there  exists  a  convergent  subsequence,  say  x.(mk\  k  >  1  which  converge  to  a  maximizer  x*  and  satisfy 

OO 

J2  Hx^+i)  -  x^f  <  C  <  00, 
k= 1 


where  C  is  a  positive  constant  dependent  on  An  and  M. 

Proof.  We  have  discussed  that  the  sequence  x(k>  has  a  convergent  subsequence.  For  convenience,  let  us 
say  x(ki  converges  to  x*.  Thus,  x(/,"+ 1  -1  —  x:  k>  will  be  very  close  when  k  >  I\ 0  for  a  positive  inteeger  Kq. 
Thus, 

P(x(fc+L)  -  P(x(fc))  >  P(x(fc+1)  -  x^)  >  An||x(fc+1)  -  xW||2 

by  using  (28)  and  (27).  Hence, 


An  Hx(fc+1)  -  xWl|2  ^  M  ~  p(x(i"0))  <  M  <  OO. 

k>K0 


This  completes  the  proof.  ■ 

Although  Yahoo  does  not  discuss  much  mathematics  behind  their  generating  the  global  similarity 
matrix  W,  one  approach  called  matrix  completion  can  be  used.  The  matrix  completion  is  a  research  topic 
in  mathematics  which  is  recently  actively  studied.  It  stalls  from  the  well-known  Netflix  problem.  Indeed, 
Netflix  (cf.  [36])  made  available  publicly  such  a  set  of  data  with  about  105  known  entries  of  its  movie 
rating  matrix  of  size  about  5  x  105  times  2  x  105  and  challenged  the  research  community  to  complete 
the  movie  recommending  matrix  with  root  mean  square  error  less  than  0.8563.  This  matrix  completion 
problem  has  been  studied  actively  ever  since.  We  refer  the  following  papers,  [26],  [41],  and  [29]  for 
in-depth  study  and  the  references  therein. 
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6  Remarks 


We  have  the  following  remarks  in  order: 

Remark  6.1  We  will  not  discuss  other  link-based  ranking  systems  for  webpages  such  as  the  HITS  algo¬ 
rithm  invented  by  Jon  Kleinberg  (used  by  Ask.com),  the  IBM  CLEVER  project,  IDD  Information  Sendees 
designed  by  Robin  Li  who  found  Baidu  in  China,  the  TrustRank  algorithm  and  etc.. 

Remark  6.2  Next  we  discuss  how  to  in  parallel  update  the  RageRank  simultaneously  for  multiple  web¬ 
pages  and  keywords  and  names  of  documents  at  a  time.  We  need  the  following 

Lemma  6.1  ( Woodbury  matrix  identity )  Suppose  that  A  is  invertible  and  C  is  also  invertible.  IfA+U CV 
is  invertible,  then 


(A  +  UCV)-1  =  A-1  -  A-'UiC-1  +  V A~lXJ)~lV A-1 . 

The  above  formula  can  be  rewritten  in  the  following  form 

Lemma  6.2  Suppose  that  U  and  V  are  unitary  matrices  and  C  is  a  diagonal  matrix  containing  nonnega¬ 
tive  values  as  entries.  If  A +  U CV  is  invertible,  then 

(A  +  UCVT)-1  =  A~x  -  (VC^UT A  +  I)-1  A-1,  (29) 

where  CL  is  the  pseudo-inverse  ofC. 

Proof.  First  of  all,  we  have 

(. A  +  UCVT)~1  =  A'1  -  J4_1?7(C't  +  1 7r  A~1U)-1Vr  A-1 


by  using  the  same  direct  proof  for  Woodbury  formula.  Furthermore,  we  use  the  invertibility  ofU,  V  and 

A  to  rewrite  the  right-hand  side  of  the  above  formula  to  get  the  equality  in  (29).  ■ 

k 

For  the  Google  matrix  A  =  I  —  dM  and  UCVT  =  —d  XjvJ,  we  know  most  entries  of  matrix 

3= 1 
k 

XjvJ  of  size  N  x  N  are  zero.  There  is  a  nonzero  block  which  is  of  size  rtk  x  rik  with  integer 

3= 1 

rik  «  N  if  k  is  not  very  big.  For  convenience,  let  us  say  the  principal  block  of  size  rik  X  rik  are 

k 

nonzero.  It  is  easy  to  find  the  singular  value  decomposition  of  this  block,  -y>vj  =  U\C\V^ .  Then 

3= 1 

let  U  =  diag(U\,  J/v_nfc),  C  =  diag(C\ ,  0 N-nk)  and  V  =  diag(V i,  /jv-nfc).  where  lN-nk  is  the  identity 
matrix  of  size  N  —  n &  and  0 N-nk  is  zero  block  of  size  N  —  nj~.  In  general  C\  is  not  invertible.  We  can 
write  it  in  the  following  form:  if  C\  is  of  rank  rk  <  rik, 


Ci 


c  11  o' 

0  0  7 
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where  C\  \  is  diagonal  matrix  containing  all  the  nonzero  singular  values  of  —  XjvJ. 

3= 1 

From  (29),  we  need  to  find  (VC^UT  A  +  I)~l.  First,  we  have 

VC'UTA=\Vl  °  1  f^ii1  0  IK  0  }\lnk-dMu  —dM \2 


0  lN-nk  o  On—ti,  o  I N—nk.  —dM2l  lN-nk  ~  dM22 


for  a  matrix  C\  which  is  of  size  n X  n^.  Then  we  have 


vrUrr  A  =  [K  0  1  \lnk-dMu  —dM \2 

-  [  0  Otv— nj  [  -dM2 1  IN-nk  -  dM22\  ’ 

where  the  Google  matrix  M  is  written  in  terms  of  the  same  division  of  blocks  as  VC'U  ' .  Furthermore, 

VC]UT  A  +  I  =  -  dMn)  +  Ink  -CiM12 

[  0  IN-nk 

_'\Ci  0  1  (\dCfl  -dMu  -dM12\  \ 

0  lN-nk  V  0  0  N-nk  ) 


Finally,  we  have 


(ytfUT  A  +  I)-1  =  (l  +  d\Cl-  Mu  ~Ml2] )  1  K  r  0 
V  L  0  ®N-nk\)  [  0  In-i 


It  follows  the  updated  PageRank  vector  is 


P  +  =  (I-  dM  -  dUCVT)~ 

=  (I-dM)-1^l-  (i  +  d 


C{ -Mn  -Mi21\  1  \dC\  0 


0  On-t 


0  In-i 


(I  -  dM  f 


=  P-  [I  +  d 


C\  —  Mu  ~Ml2]\  1  \dC\  0 


0  0/v— r 


0  In-i: 


We  are  now  ready  to  present  an  algorithm  for  multiple  updates. 

Algorithm  6.1  (Multiple  Updates )  Choose  an  integer  K  large  enough  such  that  dh  is  within  a  given 
tolerance.  We  first  compute  U\,C\,V\  as  explained  above.  Then  compute  C\.  For  the  current  PageRank 
vector  P,  modify  it  to  be 

p  =  dC\  0  p 

0  lN-nk\ 


and  then  compute  the  following  iterations 


P+«P  -JV 

3=0 


j\c\-Mu  -Mi2Y 

0  0  N-nk_ 
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Similarly  we  can  discuss  how  to  compute  M+  efficiently  when  adding  a  new  webpage/document/link 
to  the  database.  We  omit  the  detail  here. 

Remark  6.3  The  importance  of  webpages  related  to  each  keyword,  document  or  webpage  can  also  be 
calculated  based  on  the  number  of  hits  from  the  world.  However,  this  is  very  easily  be  scrolled  up  by  some 
artifical  hits. 
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1  Background 

We  arc  moving  rapidly  from  a  society  built  around  relationships  in  homes,  neighborhoods,  workplaces, 
places  of  worship,  and  voluntary  organizations,  to  a  globally  connected  society  with  interactions  that 
span  large  spatial  and  social  distances.  Sociological  understanding  of  this  transformation  has  yet  to  be 
achieved.  This  transition  can  be  thought  of  as  a  transition  from  strong  to  weak  ties.  Humans  form  direct 
communities  online,  in  the  context  of  social  networking  sites,  email  communications,  and  virtual  worlds, 
etc.  They  also,  however,  form  indirect  communities  via  the  development  of  shared  tastes,  consumption 
patterns,  media  access,  etc.  Understanding  the  character  and  consequences  of  these  direct  and  indirect 
relationships  has  been  a  key  focus  of  business  such  as  Google,  Amazon,  Pandora,  and  Yahoo,  but  has 
been  relatively  under-examined  by  contemporary  computational  social  science. 

In  the  spring  of  2011,  access  to  social  networking  sites  was  largely  hailed  as  a  prime  facilitator  of 
individuals  across  North  Africa  and  the  Middle  East  as  they  organized  expressions  of  social  unrest  and 
political  discontent  on  a  massive  scale  and  then  communicated  the  results  of  those  organized  expressions 
with  lightening  speed.  On  the  civilian  side,  the  marketing  community  is  keenly  interested  in  exploiting 
sociological  links,  especially  the  weak  sociological  links  that  have  recently  become  emplaced  in  and  ac¬ 
cessible  through  the  Internet.  Worldwide,  internet  usage  is  increasing  at  an  astounding  rate,  particularly 
the  use  of  social  networking  sites.  The  number  of  adult  internet  users  in  the  United  States  doubled  be¬ 
tween  2008  and  2010  (Hampton,  Goulet,  Rainie  Purcell  201 1  [19]).  A  recent  Pew  Research  Center  survey 
(reported  that  in  the  United  States,  social  networking  site  use  has  risen  from  26%  of  all  adults  in  2008 
to  47%  in  2010  (Rainie,  Purcell,  Smith  201 1  [31]).  Most  of  these  users  (92%)  are  on  Facebook;  13% 
use  Twitter  (Rainie,  Purcell,  Smith  2011).  Evidence  from  this  report  suggests  that  the  movement  from 
organizationally  and  geographically  organized  communities  to  online  communities  has  augmented  rather 
than  supplanted  other  types  of  sociality.  Internet  users  are  even  more  likely  than  the  average  American 
to  belong  to  voluntary  groups  or  organizations  (80%  versus  56%).  Social  networking  site  users  are  even 
more  likely  than  other  internet  users  to  belong  to  organized  groups,  with  Twitter  users  being  the  most 
likely  to  belong  to  voluntary  groups  or  organizations.  Facebook  users  are  also  more  politically  engaged 
than  other  U.S.  adults  (Hampton,  Goulet,  Rainie  Purcell  2011).  So,  while  mediated  interactions  are  taking 
place  at  unprecedented  rates,  they  do  not  seem  to  be  supplanting  other,  more  conventional  forms  of  social 
organization.  Rather,  it  is  likely  that  these  types  of  relations  interact  with  one  another  in  ways  that  are  not 
yet  adequately  understood. 
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2  Useful  Sociological  Concepts 

2.1  Tie  Strength  and  System  Size 

Social  network  techniques  for  analyzing  structure  and  position  within  social  systems  largely  developed 
to  understand  strong  sociological  links  (families,  hierarchical/command  organizations,  communities  with 
specific  structure,  nation  states,  etc.).  These  techniques  were  later  applied  to  the  study  of  weak  social 
ties  (acquaintances,  occasional  encounters,  etc.),  but  there  has  been  relatively  little  comparison  of  the  dif¬ 
ferences  between  the  two  types  of  social  systems.  Strong  sociological  links  arc  responsible  for  important 
aspects  of  human  society,  including  but  not  limited  to  the  ability  to  build  entities  that  allow  for  coordinated 
action  of  large  numbers  of  people.  Strong  sociological  links  involve  loyalty  and  giving/accepting  orders 
due  to  monetary  and  religious  relationships.  Weak  sociological  links,  on  the  other  hand,  arc  the  most 
common  sources  of  new  information  which  arc  now  increasingly  available  to  collect  via  internet.  Both 
strong  and  weak  links  play  positive  roles  in  human  society  and  in  human  progress.  However,  with  the 
Internet  and  many  other  networking  capabilities,  the  balance  between  strong  and  weak  links  is  changing. 
As  the  accessibility  of  all  links  increases,  the  relative  importance  of  the  weak  links  is  increasing. 

Tie  strength  is  a  property  of  relationships,  and  generally  refers  to  the  intensity  of  commitment,  con¬ 
straint,  or  emotion  attached  to  a  particular  link.  Another  way  to  characterize  the  difference  between  con¬ 
ventional  social  environments  and  those  in  the  world  of  Web  2.0  is  in  terms  of  system  size  and  complexity. 
While  not  equivalent,  these  arc  somewhat  conflated  in  the  scholarly  literature.  So,  when  we  talk  about 
research  on  weak  versus  strong  ties,  we  primarily  arc  distinguishing  between  analysis  of  the  relational 
structure  of  small,  often  bounded  groups,  versus  analysis  of  large,  complex  social  systems. 

An  advantage  of  studying  small,  bounded  groups  is  the  ability  to  work  with  whole  networks.  Many 
of  the  approaches  to  looking  at  social  structure  and  social  position  in  the  social  network  literature  rely  on 
graph  theory  and  utilize  the  entire  matrix  of  relations  in  their  computation  (Carrington,  Scott,  and  Wasser- 
man  2005).  Much  of  the  focus  in  the  contemporary  social  network  environment  is  on  very  large  scale, 
sparse,  and  complex  social  systems.  It  remains  an  open  question  whether  social  network  measurement 
approaches  developed  to  understand  small  whole  networks  arc  optimal  for  understanding  larger,  sparser, 
more  complex  communication  networks  or  knowledge  networks  based  on  weak  links.  In  fact,  some  recent 
research  suggests  that  other  dynamics  whose  properties  we  thought  we  understood  in  simpler  networks 
may  operate  differently  in  more  complex  systems.  In  light  of  this,  we  will  survey  the  social  network 
literature  regarding  the  measurement  and  quantification  of  various  indicators  of  social  capital  including 
social  cohesion,  social  position,  and  social  distance  and  consider  to  their  potential  applicability  to  large 
complex  systems  of  weak  relations. 

2.2  Social  Cohesion 

Social  cohesion  has  long  been  a  subject  of  investigation  in  the  study  of  groups  (Albert  1953,  Cartwright 
1968,  Lott  1961,  Van  Bergen  and  Koekebakker  1959).  Social  cohesion  refers  to  the  degree  of  solidarity 
within  a  group  or  social  system  and  usually  is  defined  as  the  degree  of  attraction  and/or  commitment 
toward  the  group/system  held  by  individual  members  of  the  group/system.  This  premise  of  an  individual- 
to-group  relation  has  been  the  subject  of  some  debate  (can  individuals  actually  relate  to  groups  or  do  they 
only  relate  to  other  individuals?).  The  literature  largely  supports  the  idea  that  individuals  can,  indeed, 
have  relationships  with  abstract  groups  and  that  these  relationships  can  precede  and  supercede  relations 
between  individuals  within  those  groups  (for  a  review,  see  Friedkin  2004[?]). 
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Social  network  researchers  also  have  used  a  variety  of  structural  properties  of  the  group  to  characterize 
its  cohesion,  rather  than  relying  on  individual  reports  of  attraction  or  commitment  to  the  group.  These 
include  the  extent  of  positive  ties  within  the  group  (Cartwright  1968[?]),  the  degree  of  symmetry  among 
positive  ties  within  the  group  (Moreno  and  Jennings  1937[?]),  and  the  density  of  interpersonal  relations 
(Festinger  et  al.  1950[?]).  There  arc  also  numerous  ways  of  identifying  cohesive  subgroups  within  larger 
systems  (Wasserman  and  Faust  1994).  These  approaches,  however,  fail  to  capture  the  individual-to-group 
aspect  of  social  cohesion,  and  run  the  risk  of  creating  a  tautology  if  we  want  to  use  structural  features  to 
predict  social  cohesion.  Research  using  more  classic  measures  of  social  cohesion  reflecting  individual 
commitment  to  the  group  finds  that  even  groups  that  are  large,  complexly  differentiated,  sparse,  and 
composed  of  weak  ties  can  be  highly  cohesive  when  they  have  conducive  structures  (Doreian  and  Fararo 
1998).  These  include  reachability  (Markovsky  1998[24])  and  a  low  density  of  negative  or  punishing  ties 
(Friedkin  2003[?]). 

Additional  research  suggests  that  it  rather  the  repeated  activation  of  positive  ties  that  produces  group 
cohesion  (Lawler,  Thye  and  Yoon  2000;  McPherson  and  Smith-Lovin  2002),  rather  than  simply  the  quan¬ 
tity  of  them.  This  finding  that  ongoing  nature  of  social  relations  differentiates  them  from  other  kinds  of  ties 
considered  in  isolation  is  related  to  the  embeddedness  approach  in  economic  sociology  (e.g.,  Granovetter 
1985;  Uzzi  1999).  Embeddedness  takes  into  account  the  degree  to  which  mutual  ties,  reachability,  and 
other  opportunities  for  feedback  loops  in  a  network  system  create  additional  pressures  toward  trust  and 
cohesion.  When  a  relationship  is  embedded  within  a  larger  system  of  relationships,  there  is  both  a  larger 
shadow  of  the  past  and  a  larger  shadow  of  the  future.  This  has  the  effect  of  creating  more  enduring  rela¬ 
tionships  and  a  greater  commitment  to  the  groups  or  systems  in  which  these  relationships  arc  embedded. 

In  summary,  social  cohesion  is  understood  as  a  way  of  characterizing  the  stability  and  intensity  of 
the  relationship  between  individual  members  of  a  group  or  system  and  the  group  itself.  Structural  posi¬ 
tion  (reviewed  further  below)  and  embeddedness  predict  the  individual-to-group  relationships  from  which 
group  level  cohesion  derives.  Structural  conditions  like  density,  reachability,  reciprocity,  and  repetition  of 
tie  activation  predict  cohesion  at  the  group  level.  While  nodal  reach  and  system  reachability  arc  not  easily 
calculated  in  large  or  incomplete  systems,  reciprocity  and  repetition  of  tie  activation  are  network  features 
that  can  be  accessed  with  egocentric  data  from  sampled  nodes,  and  so  arc  features  that  might  easily  be 
used  in  understanding  the  dynamics  of  massive,  complex  systems. 

2.3  Various  Measures  of  Centrality 

The  mostly  widely  investigated  social  network  measures  arc  those  characterizing  social  position  some¬ 
times  called  social  prominence  or  network  centrality.  In  the  social  network  literature,  there  arc  four 
primary  means  of  characterizing  the  structural  position  of  a  particular  node  degree,  betweenness,  close¬ 
ness,  and  Bonacich  power.  These  arc  methods  of  determining  the  centrality  of  a  vertex  within  a  graph.  In 
the  context  of  social  networks,  they  arc  typically  used  to  determine  the  importance  of  a  particular  person 
within  the  group  or  system.  The  usefulness  of  each  of  these  methods  of  characterizing  structural  position 
depends  on  features  of  the  system,  the  social  context,  and  the  nature  of  the  resource  flowing  through  the 
graph  or  network.  We  will  describe  these  further  below. 

Degree.  The  simplest  and  among  the  most  frequently  used  measure  of  structural  position  is  degree 
based  centrality.  In  symmetric  networks,  this  is  simply  the  number  of  lines  or  edges  connecting  a  particular 
node  or  vertex  to  other  nodes  (Freeman  1979[16]).  In  simple  affiliation  systems,  this  is  considered  to  be  a 
basic  characterization  of  popularity.  A  count  of  ones  Facebook  friends  is  a  fairly  ubiquitous  contemporary 
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measure  of  degree  centrality  among  contemporary  college  students.  While  Facebook  friendships  occur  in 
a  symmetric  social  network,  many  naturally  occurring  networks  are  fundamentally  asymmetric  in  nature. 
Liking,  respect,  information  seeking,  and  assistance  are  routinely  exchanged  asymmetrically.  In  these 
cases,  it  helps  to  distinguish  between  in-degree  (number  of  ties  received)  versus  out-degree  (number  of 
ties  sent)  to  determine  social  prominence  in  a  network.  In-degree  is  a  more  precise  measure  of  prominence 
when  the  resource  flow  is  positive  and  or  deferent  (e.g.,  respect,  advice-seeking).  Out-degree  may  be  a 
more  precise  measure  of  social  prominence  when  the  resource  flow  is  the  diffusion  of  information.  Nodes 
with  high  out-degree  can  serve  as  gatekeepers  in  a  social  system.  Asymmetry  between  out-degree  and 
in-degree  can  also  serve  as  a  measure  of  node  prominence.  When  group  size  is  known,  actor  degree  can 
be  standardized  by  the  group  size  in  order  to  compare  prominence  of  actors  across  groups  (Wasserman 
and  Faust  1994:179). 

An  elaboration  of  this  approach  developed  by  Bonacich  (1987)  uses  iterative  simultaneous  equations 
to  converge  on  an  estimate  of  power  that  combines  degree  of  actor  with  information  about  the  actors 
relational  neighborhood.  This  method  recognizes  that  being  connected  to  others  with  many  connections 
can  increase  an  actors  importance  in  a  positively  connected  (contagious)  network  and  simultaneously 
decrease  ones  power  in  a  negatively  connected  (competitive)  network.  Imagine  a  network  in  which  Sally 
and  Bob  each  have  five  friends.  Sallys  friends  each  have  a  high  number  of  other  friends;  Bobs  friends 
are  isolates  and  are  not  friends  with  many  others.  If  the  social  process  of  interest  is  contagious,  like  the 
diffusion  of  information,  the  transfer  of  disease,  or  even  diffusion  of  positive  regal'd,  then  Sally  would 
have  more  influence  than  Bob.  She  gets  her  power  by  being  connected  to  other  highly  connected  others. 
In  contrast,  if  the  resource  flowing  across  the  network  is  competitive,  then  Bob  may  gain  more  power  by 
being  connected  to  others  who  are  more  dependent  upon  him  for  the  competitive  resource.  Bonacichs 
algorithm  accordingly  allows  for  specifying  the  level  and  direction  of  attenuation  in  the  network. 

Closeness.  A  different  approach  to  quantifying  an  actor’s  power  in  a  social  network  is  closeness-based 
centrality  (Freeman  1979[16]).  This  approach  presumes  that  a  node’s  power  is  a  function  of  it  geodesics. 
The  simplest  version  is  simply  an  inverse  of  the  sum  of  all  distances  from  an  actor  to  all  other  actors  in  a 
network.  This  approach  to  power  is  especially  useful  in  positively  connected  (contagious)  networks  across 
which  resources  diffuse  with  some  moderate  rate  of  decay.  An  elaboration  of  the  closeness  centrality 
approach,  called  reach  centrality,  considers  what  portion  of  the  network  an  actor  can  reach  with  each 
additional  number  of  steps  (Hanneman  and  Riddle  2005 [20]). 

Betweenness.  A  third  general  approach  to  quantifying  an  actor’s  power  in  a  social  network  relies  on 
betweenness-based  centrality  (Freeman  1977[15]).  This  approach  recognizes  that  nteractions  between 
unconnected  members  of  a  network  often  critically  depend  on  other  actors  in  the  system  -  especially  those 
who  lie  on  the  paths  between  the  two.  The  simplest  measure  of  betweenness  centrality  simply  counts  all 
of  the  geodesics  between  all  pairs  of  actors  in  a  system  which  contain  a  particular  actor.  An  elaboration  of 
this  idea,  called  information  centrality,  generalizes  this  to  include  all  paths  between  all  actors,  weighted 
by  the  inverse  of  their  lengths  when  calculating  centrality  (Stephenson  and  Zelen  1989[37]).  This  takes 
into  account  the  idea  that  information  does  not  always  flow  along  the  shortest  path,  and  that  actor  can  gain 
importance  by  controlling  the  flow  across  many  paths,  as  well  as  by  controlling  only  a  few  short  paths. 

Eigenvector  Centrality.  A  fourth  approach  to  quantifying  structural  position  uses  a  factor  analytic 
procedure  to  discount  closeness  to  small  local  subnetworks  (Bonacich  1972[2]).  This  approach,  called 
eigenvector  centrality,  allows  researchers  to  differentiate  between  proximity  in  the  global  structure  and 
proximity  in  more  local  substructures,  by  computing  principal  components  of  the  actor  distance  mea¬ 
sures  and  generating  an  eigenvalue  for  each  actor  on  each  structural  dimension  (Freeman  1979).  This 
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approach  generates  estimates  of  structural  position  that  arc  very  close  to  degree  based  centrality  when 
there  is  a  fairly  flat  distribution  of  degree,  or  in  core -periphery  structures,  where  high  degree  nodes  arc 
connected  primarily  to  other  high  degree  nodes  (Bonacich  2007[?]).  But,  in  structures  with  many  con¬ 
nections  between  low  degree  and  high  degree  nodes,  the  kinds  of  hierarchical  lustering  that  characterizes 
most  naturally  occurring  systems  (Barabasi  et  al.  1999;  Watts  et  al.  2002;  Watts  2004  [39])  this  method 
produces  distinctively  different  and  much  more  useful  predictions. 

Dependence-based  power.  Markovsky  and  colleagues  (Markovsky  et  al.  1994;  1998  [23]  and  [24]) 
have  developed  another  set  of  measures  specifically  designed  to  measure  potential  for  structural  influence 
in  negatively  connected,  or  competitive  networks.  The  simplest  form  of  such  competition  is  when  there 
is  no  resource  flow  (perfect  decay)  and  nodes  are  limited  to  single  exchange  partners.  In  such  networks, 
the  structure  of  relationships  in  one  paid  of  system  can  constrain  power  relations  within  dyads  far  away  in 
the  system  in  systematic  ways.  The  simplest  of  these  graph-theoretic  power  indices  (called  GPI)  subtracts 
the  number  of  disadvantageous  paths  (the  number  of  unique  even  length  paths  between  a  node  an  all 
other  nodes  in  the  system)  from  the  advantageous  paths  (the  number  of  unique  odd  length  paths  between 
a  node  and  all  other  nodes)  to  generate  a  measure  of  dependence-based  power.  Individuals  with  many 
ties  to  individuals  with  no  other  ties  (paths  of  length  1)  arc  rewarded  for  their  ties  to  dependent  others. 
These  measures  have  been  used  to  successfully  predict  power  use  and  exchange  outcomes  in  a  number  of 
experimental  studies  of  human  interaction. 

These  measures  of  structural  position  have  varying  utility  by  context.  In  whole  networks  we  can 
easily  see  how  the  decay  rate  of  a  resource  flow  determines  which  measure  of  structural  position  is  most 
useful.  When  a  resource  flow  is  very  quickly  consumed,  then  degree  based  centrality  may  be  the  most 
useful.  We  can  think  of  many  social  behaviors  and  affects  that  arc  like  this.  If  Sally  smiles  at  Bob,  her 
smile  is  consumed.  Bob  can  smile  at  another  co-worker  but  it  wont  be  Sallys  smile.  In  this  case,  we 
want  to  understand  patterns  of  single-transfer  social  behaviors  and  affects,  degree  based  centrality  may 
be  sufficient.  When  a  resource  flow  can  travel  through  one  or  more  nodes,  but  with  some  decay  factor, 
closeness-based  centrality  becomes  more  important.  Information  flows  across  a  network,  but  tends  to 
lose  veracity  and  sometimes  change  (increase  or  decrease)  intensity  with  each  transfer.  Consequently, 
being  closer  to  the  source  provides  one  with  more  accurate  information.  Being  connected  to  many  well- 
connected  others  may  facilitate  more  influence  than  simply  having  the  most  friends.  When  individuals 
serve  as  liaisons  across  various  densely  connected  regions  in  a  system,  they  accrue  betweenness-based 
power,  allowing  them  to  serve  as  gatekeepers  for  resources  that  have  slower  decay  functions. 

Closeness-based  centrality,  betweenness-based  centrality,  and  dependence-based  require  whole  net¬ 
works  for  computation.  This  limits  their  utility  to  contexts  where  full  information  is  available  and  raises 
the  question  their  computational  efficiency  when  applied  to  very  large,  sparse  networks.  Degree  based 
centrality  has  the  benefit  of  being  easy  to  access  and  not  requiring  whole  network  information  and  so  can 
easily  be  used  in  the  context  of  egocentric  data  or  in  large,  complex,  and  sparse  networks.  Some  of  the 
valiants  of  degree  based  centrality,  including  eigenvector  centrality  and  the  variant,  PageRank,  used  by 
Googles  search  engine,  can  be  calculated  with  only  the  use  of  first-order  ties,  making  them  computation¬ 
ally  much  more  efficient  in  large,  sparse,  and  incomplete  networks.  With  the  consideration  one  additional 
order  of  relationships  (out  to  2nd  order  ties),  the  formula  could  be  substantially  improved  to  take  into 
account  the  kind  of  local-distal  effects  better  captured  by  whole-network  approaches  such  as  closeness, 
betweenness,  and  especially  dependence-based  power. 


5 


2.4  Social  Distance 

Social  distance  measures  have  been  used  in  sociology  for  nearly  a  century.  Many  of  these  measures,  like 
the  classic  Bogardus  social  distance  scale  (1925),  arc  self-report  attitudinal  measures.  There  also  is  a 
long  history,  however,  of  quantifying  social  distance  and  social  positions  using  social  network  techniques. 
Most  of  these  techniques  for  identifying  social  distance  at  the  dyadic  level  are  the  same  as  those  used  to 
identify  social  cohesion  at  the  group  or  system  level.  The  simplest  of  these  techniques  require  that  the 
geodesic  distances  between  the  nodes  in  a  network  or  subnetwork  be  small.  Others  compare  within-group 
ties  to  out-group  ties.  Other  approaches  make  use  of  clustering  or  multi-dimensional  scaling  techniques 
to  represent  social  distance  along  a  small  number  of  dimensions  for  visual  interpretation. 

Some  of  these  techniques  focus  on  identifying  sets  of  structurally  equivalent  actors.  One  such  ap¬ 
proach  uses  Pearsons  correlation  as  a  measure  of  structural  equivalence  and  uses  the  convergence  of 
iterated  correlations  between  relations  as  a  means  of  partitioning  into  subsets.  Using  connection  through 
music  tastes  as  an  example  let  us  describe  this  classic  technique  (Breiger  et  al.  1975,  White  et  al.  1976). 
Take  an  adjacency  matrix  A  of  actors  and  music  purchases.  Multiply  the  matrix  A  by  its  transpose  A* 
to  get  an  actor  X  actor  matrix  of  people  connected  through  their  shared  music  preferences.  Correlate  the 
rows  and  columns  of  this  new  matrix.  Replace  the  values  in  the  matrix  with  the  results.  Repeat  until  the 
cells  are  tilled  with  Is  and  Os.  Separate  the  l’s  and  0's  into  separate  matrices.  Replace  with  values  from 
original  Actor  by  Actor  matrix  and  staid  again.  Each  successive  split  will  group  actors  into  subgroups  with 
more  similarly  shared  patterns  of  relations  to  others  (through  shared  music  preferences).  This  creates  a 
binary  tree  of  partitions  among  actors,  with  all  actors  being  partitioned  into  exhaustive  and  mutually  ex¬ 
clusive  subsets.  Finally,  partition  the  original  actors  by  the  resulting  positions  and  permute  the  matrix  to 
reveal  the  relationships  between  the  structurally  equivalent  blocks. 

A  valiant  on  this  technique  relies  instead  on  the  Euclidean  distances  between  the  ties  to  and  from  two 
actors,  instead  of  using  Pearsons  correlations  to  capture  degree  of  similarity  (Burt  1976).  For  each  pair  of 
actors  i  and  j  take  the  Euclidean  distance  between  rows  i  and  j  and  columns  i  and  j.  When  two  actors  arc 
structurally  equivalent  (connected  to  the  same  other  actors),  they  distance  between  them  will  be  0.  Once 
a  matrix  of  equivalence  relations  has  been  computed,  actors  can  be  partitioned  into  cohesive  subgroups 
through  the  use  of  hierarchical  clustering. 

The  correlation  based  measure  of  structural  equivalence  has  the  advantage  of  capturing  the  equivalence 
of  actors  who  have  similar  kinds  of  relations  with  similar  kinds  of  others,  while  the  Euclidean  distance 
measures 

For  example  ’’long  ties”  (those  connecting  actors  who  are  socially  distant  along  other  social  dimen¬ 
sions)  increase  diffusion  rates  in  some  classic  studies,  like  Granovetter’s  (1974)  study  of  job  seekers  in 
which  individuals  were  more  likely  to  find  jobs  via  weak  (and  long)  ties  than  via  strong  ties.  In  other 
words,  long  ties  speed  simple  contagions.  Long  ties,  however,  actually  slow  diffusion  of  information 
when  adoption  requires  multiple  affirmations.  In  these  systems,  long  ties  slow  complex  contagion  pro¬ 
cesses  (Centola  and  Macy  2007;  Centola,  Eguilez  and  Macy,  2007). 

3  Conclusion 

The  PI  and  co-PI  have  immersed  in  the  mathematical  and  sociological  literatures  on  social  networks 
and  made  some  initial  connections  between  them.  Above,  we  have  briefly  summarized  the  We  have 
summarized  the  sociology  of  social  structure,  position  and  influence  in  strong  and  weak  networks.  In  the 
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appended  document,  we  summarize  the  mathematical  study  about  the  Google  search  algorithm  along  with 
some  suggested  improvements  based  on  and  this  sociological  literature.  To  further  explore  and  develop 
these  connections  and  their  implications  will  require  a  greater  time  investment  than  afforded  by  this  seed 
project. 

With  more  time,  this  research  team  could  more  deeply  digest  the  existing  information,  and  propose 
some  new  algorithms,  making  use  of  sociological  insights  to  improve  mathematically  the  characterization 
of  relations  in  large,  complex  systems.  The  PI  and  co-PI  are  more  than  willing  to  continue  our  joint  work 
toward  a  better  understanding  of  the  searching  algorithms  and  discovering  how  to  better  situate  them  in 
the  larger  contexts  of  existing  mathematical  and  sociological  knowledge. 
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